% Credits: Some of these materials were adapted from Software Carpentry
For some, it might just be basic calculations
63.24 * pi # Multiply 63.24 by pi
## [1] 198.6743
exp(x = 4.39) # Raise e to the power of 4.39
## [1] 80.64042
log(x = 1.7) # Take the log of 1.7
## [1] 0.5306283
tan(x = 58) # Compute the tangent of 58
## [1] 8.330857
For others, it might be large or complex mathematical operations
# Take one million samples from the standard normal distribution
data.sample <- rnorm(n=1000000, mean=0, sd=1)
# Build a 1000 x 1000 matrix from the sample data
big.matrix <- matrix(data=data.sample, ncol=1000)
dim(x = big.matrix) # Confirm that "big.matrix" is 1000 x 1000
## [1] 1000 1000
big.matrix.inverse <- solve(a=big.matrix) # Compute the inverse of "big.matrix"
system.time(expr = solve(a=big.matrix)) # Compute time required to invert "big.matrix"
## user system elapsed
## 0.365 0.515 0.170
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
For most applied researchers, “useful stuff” that can be done in R boils down to a few core items:
So far, you’ve seen the basics of manipulating data.frames; now, let’s use those skills to digest a more realistic dataset.
For this unit, we'll be working with the “Gapminder” dataset, which is excerpt of the data available at Gapminder.org. For each of 142 countries, the data provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
gapminder <- read.csv("../data/gapminder-FiveYearData.csv", stringsAsFactors = T)
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
So far, you’ve seen the basics of manipulating data frames, e.g. subsetting, merging, and basic calculations. For instance, we can use base R functions to calculate summary statistics across groups of observaitons:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
## [1] 7902.15
But this isn't ideal because it involves a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition. As an added bonus, you might even find the dplyr grammar easier to read.
Here we're going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.
select()filter()group_by()summarize()mutate()arrange()If you have have not installed this package earlier, please do so now:
# not run
# install.packages('dplyr')
Now let's load the package:
library(dplyr)
Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the variables we select.
year_country_gdp <- select(gapminder, year, country, gdpPercap)
If we open up year_country_gdp, we'll see that it only contains the year, country and gdpPercap. This is equivalent to the base R subsetting function:
year_country_gdp <- gapminder[,c("year", "country", "gdpPercap")]
But, as we will see, dplyr makes for much more readible, efficient code because of its pipe operator.
Above, we used what's called 'normal' grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we've seen in R before, let's repeat what we've done above using pipes.
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
Let's walk through it step by step. First we summon the gapminder dataframe and pass it on, using the pipe symbol %>%, to the next step, which is the select() function. In this case we don't specify which data object we use in the select() function since in gets that from the previous pipe. Fun Fact: There is a good chance you have encountered pipes before in the shell. In R, a pipe symbol is %>% while in the shell it is |. But the concept is the same!
Now let's say we're only interested in African countries. We can combine select and filter to select only the observations where continent is Africa.
year_country_gdp_euro <- gapminder %>%
filter(continent=="Africa") %>%
select(year,country,gdpPercap)
As with last time, first we pass the gapminder dataframe to the filter() function, then we pass the filtered version of the gapminder dataframe to the select() function.
To clarify, both the select and filter functions subsets the data frame. The difference is that select extracts certain columns, while filter extracts certain rows.
Note: The order of operations is very important in this case. If we used 'select' first, filter would not be able to find the variable continent since we would have removed it in the previous step.
A common task you'll encounter when working with data is that you'll want to run calculations on different groups within the data.. For instance, what if we wanted to calculated the mean GDP per capita for each continent?
In base R, you would have to run the mean() function for each subset of data:
mean(gapminder$gdpPercap[gapminder$continent=="Africa"])
## [1] 2193.755
mean(gapminder$gdpPercap[gapminder$continent=="Americas"])
## [1] 7136.11
mean(gapminder$gdpPercap[gapminder$continent=="Asia"])
## [1] 7902.15
mean(gapminder$gdpPercap[gapminder$continent=="Europe"])
## [1] 14469.48
mean(gapminder$gdpPercap[gapminder$continent=="Oceania"])
## [1] 18621.61
That's a lot of repetition! To make matters worse, what if we wanted to add these values to our original data frame as a new column? We would have to write something like this:
gapminder$mean.continent.GDP <- NA
gapminder$mean.continent.GDP[gapminder$continent=="Africa"] <- mean(gapminder$gdpPercap[gapminder$continent=="Africa"])
gapminder$mean.continent.GDP[gapminder$continent=="Americas"] <- mean(gapminder$gdpPercap[gapminder$continent=="Americas"])
gapminder$mean.continent.GDP[gapminder$continent=="Asia"] <- mean(gapminder$gdpPercap[gapminder$continent=="Asia"])
gapminder$mean.continent.GDP[gapminder$continent=="Europe"] <- mean(gapminder$gdpPercap[gapminder$continent=="Europe"])
gapminder$mean.continent.GDP[gapminder$continent=="Oceania"] <- mean(gapminder$gdpPercap[gapminder$continent=="Oceania"])
You can see how this can get pretty tedious, especially if we want to calculate more complicated or refined statistics. We could use loops or apply functions, but these can be difficult, slow, or error-prone.
The abstract problem we're encountering here is know as “split-apply-combine”:
We want to split our data into groups (in this case continents), apply some calculations on that group, then combine the results together afterwards. Luckily, dplyr can help us with this problem.
# remove this column -- there's a better way!
gapminder$mean.continent.GDP <- NULL
We've already seen how filter() can help us select observations that meet certain criteria (in the above: continent=="Europe"). More helpful, however, is the group_by() function, which will essentially use every unique criteria that we could have used in filter().
str(gapminder)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
str(gapminder %>% group_by(continent))
## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## - attr(*, "vars")=List of 1
## ..$ : symbol continent
## - attr(*, "drop")= logi TRUE
## - attr(*, "indices")=List of 5
## ..$ : int 24 25 26 27 28 29 30 31 32 33 ...
## ..$ : int 48 49 50 51 52 53 54 55 56 57 ...
## ..$ : int 0 1 2 3 4 5 6 7 8 9 ...
## ..$ : int 12 13 14 15 16 17 18 19 20 21 ...
## ..$ : int 60 61 62 63 64 65 66 67 68 69 ...
## - attr(*, "group_sizes")= int 624 300 396 360 24
## - attr(*, "biggest_group_size")= int 624
## - attr(*, "labels")='data.frame': 5 obs. of 1 variable:
## ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
## ..- attr(*, "vars")=List of 1
## .. ..$ : symbol continent
You will notice that the structure of the dataframe where we used group_by() (grouped_df) is not the same as the original gapminder (data.frame). A grouped_df can be thought of as a list where each item in the list is a data.frame which contains only the rows that correspond to the a particular value continent (at least in the example above).
The above was a bit uneventful because group_by() is much more exciting in conjunction with the summarize() function. This will allow use to create new variable(s) by using functions that repeat for each of the continent-specific data frames. In other words, using the group_by() function, we split our original dataframe into multiple pieces, which we then use to run functions (e.g. mean() or sd()) within summarize().
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
That allowed us to calculate the mean gdpPercap for each continent. But it gets even better – the function group_by() allows us to group by multiple variables. Let's group by year and continent.
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
That is already quite powerful, but it gets even better! You're not limited to defining 1 new variable in summarize().
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
What if we wanted to add these values to our original data frame instead of creating a new object? For this, we can use the mutate() function, which is similar to summarize() except it creates new variables to the same dataframe that you pass into it.
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
We can use also use mutate() to create new variables prior to (or even after) summarizing information.
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion = gdpPercap*pop/10^9) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop),
mean_gdp_billion = mean(gdp_billion),
sd_gdp_billion = sd(gdp_billion))
As a last step, let's say we want to sort the rows in our data frame according to values in a certain column. We can use the arrange() function to do this. For instance, let's organize our rows by year (recent first), and then by continent.
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop)) %>%
arrange(desc(year), continent)
Even before we conduct analysis or calculations, we need to put our data into the correct format. The goal here is to rearrange a messy dataset into one that is tidy
The two most important properties of tidy data are:
1) Each column is a variable. 2) Each row is an observation.
Tidy data is easier to work with, because you have a consistent way of referring to variables (as column names) and observations (as row indices). It then becomes easy to manipulate, visualize, and model.
For more on the concept of tidy data, read Hadley Wickham's paper here
“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham
Tabular datasets can be arranged in many ways. For instance, consider the data below. Each data set displays information on heart rate observed in individuals across 3 different time periods. But the data are organized differently in each table.
wide <- data.frame(
name = c("Wilbur", "Petunia", "Gregory"),
time1 = c(67, 80, 64),
time2 = c(56, 90, 50),
time3 = c(70, 67, 101)
)
wide
## name time1 time2 time3
## 1 Wilbur 67 56 70
## 2 Petunia 80 90 67
## 3 Gregory 64 50 101
long <- data.frame(
name = c("Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory"),
time = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
heartrate = c(67, 80, 64, 56, 90, 50, 70, 67, 10)
)
long
## name time heartrate
## 1 Wilbur 1 67
## 2 Petunia 1 80
## 3 Gregory 1 64
## 4 Wilbur 2 56
## 5 Petunia 2 90
## 6 Gregory 2 50
## 7 Wilbur 3 70
## 8 Petunia 3 67
## 9 Gregory 3 10
Question: Which one of these do you think is the tidy format?
Answer: The first dataframe (the “wide” one) would not be considered tidy because values (i.e., heartrate) are spread across multiple columns.
We often refer to these different structurs as “long” vs. “wide” formats. In the “long” format, you usually have 1 column for the observed variable and the other columns are ID variables.
For the “wide” format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). In the above case, we had the same kind of data (heart rate) entered across 3 different columns, corresponding to three different time periods.
You may find data input may be simpler or some other applications may prefer the “wide” format. However, many of R’s functions have been designed assuming you have “long” format data.
Thankfully, the tidyr package will help you efficiently transform your data regardless of original format.
# Install the "tidyr" package (only necessary one time)
# install.packages("tidyr") # Not Run
# Load the "reshape2" package (necessary every new R session)
library(tidyr)
Until now, we’ve been using the nicely formatted original gapminder dataset, but 'real' data (i.e. our own research data) will never be so well organized. Here let's start with the wide format version of the gapminder dataset.
gap_wide <- read.csv("../data/gapminder_wide.csv", stringsAsFactors = FALSE)
str(gap_wide)
## 'data.frame': 142 obs. of 38 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ gdpPercap_1952: num 2449 3521 1063 851 543 ...
## $ gdpPercap_1957: num 3014 3828 960 918 617 ...
## $ gdpPercap_1962: num 2551 4269 949 984 723 ...
## $ gdpPercap_1967: num 3247 5523 1036 1215 795 ...
## $ gdpPercap_1972: num 4183 5473 1086 2264 855 ...
## $ gdpPercap_1977: num 4910 3009 1029 3215 743 ...
## $ gdpPercap_1982: num 5745 2757 1278 4551 807 ...
## $ gdpPercap_1987: num 5681 2430 1226 6206 912 ...
## $ gdpPercap_1992: num 5023 2628 1191 7954 932 ...
## $ gdpPercap_1997: num 4797 2277 1233 8647 946 ...
## $ gdpPercap_2002: num 5288 2773 1373 11004 1038 ...
## $ gdpPercap_2007: num 6223 4797 1441 12570 1217 ...
## $ lifeExp_1952 : num 43.1 30 38.2 47.6 32 ...
## $ lifeExp_1957 : num 45.7 32 40.4 49.6 34.9 ...
## $ lifeExp_1962 : num 48.3 34 42.6 51.5 37.8 ...
## $ lifeExp_1967 : num 51.4 36 44.9 53.3 40.7 ...
## $ lifeExp_1972 : num 54.5 37.9 47 56 43.6 ...
## $ lifeExp_1977 : num 58 39.5 49.2 59.3 46.1 ...
## $ lifeExp_1982 : num 61.4 39.9 50.9 61.5 48.1 ...
## $ lifeExp_1987 : num 65.8 39.9 52.3 63.6 49.6 ...
## $ lifeExp_1992 : num 67.7 40.6 53.9 62.7 50.3 ...
## $ lifeExp_1997 : num 69.2 41 54.8 52.6 50.3 ...
## $ lifeExp_2002 : num 71 41 54.4 46.6 50.6 ...
## $ lifeExp_2007 : num 72.3 42.7 56.7 50.7 52.3 ...
## $ pop_1952 : num 9279525 4232095 1738315 442308 4469979 ...
## $ pop_1957 : num 10270856 4561361 1925173 474639 4713416 ...
## $ pop_1962 : num 11000948 4826015 2151895 512764 4919632 ...
## $ pop_1967 : num 12760499 5247469 2427334 553541 5127935 ...
## $ pop_1972 : num 14760787 5894858 2761407 619351 5433886 ...
## $ pop_1977 : num 17152804 6162675 3168267 781472 5889574 ...
## $ pop_1982 : num 20033753 7016384 3641603 970347 6634596 ...
## $ pop_1987 : num 23254956 7874230 4243788 1151184 7586551 ...
## $ pop_1992 : num 26298373 8735988 4981671 1342614 8878303 ...
## $ pop_1997 : num 29072015 9875024 6066080 1536536 10352843 ...
## $ pop_2002 : int 31287142 10866106 7026113 1630347 12251209 7021078 15929988 4048013 8835739 614382 ...
## $ pop_2007 : int 33333216 12420476 8078314 1639131 14326203 8390505 17696293 4369038 10238807 710960 ...
The first step towards getting our nice tidy data format is to first convert from the wide to the long format.
The function gather() will 'gather' the observation variables into a single variable. This is sometimes called “melting” your data, because it melts the table from wide to long. Those data will be melted into two variables: one for the variable names, and the other for the variable values.
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, 3:38)
str(gap_long)
## 'data.frame': 5112 obs. of 4 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: Factor w/ 36 levels "gdpPercap_1952",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ obs_values : num 2449 3521 1063 851 543 ...
Notice that we put 3 arguments into the gather() function:
obstype_year), obs_value), 3:38, signalling columns 3 through 38) that we want to gather into one variable. Notice that we don't want to melt down columns 1 and 2, as these are considered “ID” variables.We can also select observation variables using:
x:z to select all variables between x and z-y to exclude ystarts_with(x, ignore.case = TRUE): all names that starts with xends_with(x, ignore.case = TRUE): all names that ends with xcontains(x, ignore.case = TRUE): all names that contain xSee the select() function in dplyr for more options.
For instance, here we do the same thing with (1) the starts_with function, and (2) the - operator:
# with the starts_with() function
gap_long <- gap_wide %>%
gather(obstype_year, obs_values, starts_with('pop'),
starts_with('lifeExp'), starts_with('gdpPercap'))
str(gap_long)
## 'data.frame': 5112 obs. of 4 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: Factor w/ 36 levels "pop_1952","pop_1957",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ obs_values : num 9279525 4232095 1738315 442308 4469979 ...
# with the - operator
gap_long <- gap_wide %>%
gather(obstype_year,obs_values,-continent,-country)
str(gap_long)
## 'data.frame': 5112 obs. of 4 variables:
## $ continent : chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Angola" "Benin" "Botswana" ...
## $ obstype_year: Factor w/ 36 levels "gdpPercap_1952",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ obs_values : num 2449 3521 1063 851 543 ...
However you choose to do it, notice that the output collapses all of the measure variables into two columns: one containing new ID variable, the other containing the observation value for that row.
You'll notice that in our long dataset, obstype_year actually contains 2 pieces of information, the observation type (pop, lifeExp, or gdpPercap) and the year.
We can use the separate() function to split the character strings into multiple variables:
gap_long_sep <- gap_long %>%
separate(obstype_year, into=c('obs_type','year'), sep="_")
gap_long_sep$year <- as.integer(gap_long_sep$year)
The opposite of gather() is spread(). It spreads our observation variables back out to make a wider table We can use this function to spread our gap_long() to the original format.
gap_wide <- gap_long_sep %>%
spread(obs_type, obs_values)
str(gap_wide)
## 'data.frame': 1704 obs. of 6 variables:
## $ continent: chr "Africa" "Africa" "Africa" "Africa" ...
## $ country : chr "Algeria" "Algeria" "Algeria" "Algeria" ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ gdpPercap: num 2449 3014 2551 3247 4183 ...
## $ lifeExp : num 43.1 45.7 48.3 51.4 54.5 ...
## $ pop : num 9279525 10270856 11000948 12760499 14760787 ...
All we need is some quick fixes to make this dataset identical to the original gapminder dataset:
gapminder <- read.csv("../data/gapminder-FiveYearData.csv")
head(gap_wide)
## continent country year gdpPercap lifeExp pop
## 1 Africa Algeria 1952 2449.008 43.077 9279525
## 2 Africa Algeria 1957 3013.976 45.685 10270856
## 3 Africa Algeria 1962 2550.817 48.303 11000948
## 4 Africa Algeria 1967 3246.992 51.407 12760499
## 5 Africa Algeria 1972 4182.664 54.518 14760787
## 6 Africa Algeria 1977 4910.417 58.014 17152804
head(gapminder)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
# rearrange columns
gap_wide <- gap_wide[,names(gapminder)]
head(gap_wide)
## country year pop continent lifeExp gdpPercap
## 1 Algeria 1952 9279525 Africa 43.077 2449.008
## 2 Algeria 1957 10270856 Africa 45.685 3013.976
## 3 Algeria 1962 11000948 Africa 48.303 2550.817
## 4 Algeria 1967 12760499 Africa 51.407 3246.992
## 5 Algeria 1972 14760787 Africa 54.518 4182.664
## 6 Algeria 1977 17152804 Africa 58.014 4910.417
# arrange by country, continent, and year
gap_wide <- gap_wide %>%
arrange(country,continent,year)
head(gap_wide)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
dplyr and tiyr have many more functions to help you wrangle and manipulate your data. See the Data Wrangling Cheat Sheet for more.
Once we've carried out group-wise operations and perhaps reshaped it, we may also like to attempt describing the relationships in the data or conducting some causal inference
This often requires doing the following:
Estimating Regressions
Carryingout Regression Diagnostics
Running regressions in R is extremely simple, very straightforwd (though doing things with standard errors requires a little extra work)
Most basic, catch-all regression function in R is glm
glm fits a generalized linear model with your choice of family/link function (gaussian, logit, poisson, etc.)
lm is just a standard linear regression (equivalent to glm with family=gaussian(link=“identity”))
The basic glm call looks something like this:
glm(formula=y~x1+x2+x3+..., family=familyname(link="linkname"), data=)
There are a bunch of families and links to use (help(family) for a full list), but some essentials are binomial(link = “logit”), gaussian(link = “identity”), and poisson(link = “log”)
Example: suppose we want to regress the life expectency on the GDP per capita and the population, as well as the continent and year. The glm call would be something like this:
# Regress tip percent on total bill and party size
reg <- glm(formula = lifeExp ~ gdpPercap + pop + continent + year,
family=gaussian, data=gapminder)
# View objects contained in the regression output
objects(reg)
## [1] "aic" "boundary" "call"
## [4] "coefficients" "contrasts" "control"
## [7] "converged" "data" "deviance"
## [10] "df.null" "df.residual" "effects"
## [13] "family" "fitted.values" "formula"
## [16] "iter" "linear.predictors" "method"
## [19] "model" "null.deviance" "offset"
## [22] "prior.weights" "qr" "R"
## [25] "rank" "residuals" "terms"
## [28] "weights" "xlevels" "y"
# Examine regression coefficients
reg$coefficients
## (Intercept) gdpPercap pop continentAmericas
## -5.184555e+02 2.984892e-04 1.790640e-09 1.429204e+01
## continentAsia continentEurope continentOceania year
## 9.375486e+00 1.936120e+01 2.055921e+01 2.862583e-01
# Examine regression DoF
reg$df.residual
## [1] 1696
# Examine regression fit (AIC)
reg$aic
## [1] 11420.07
summary(reg)
##
## Call:
## glm(formula = lifeExp ~ gdpPercap + pop + continent + year, family = gaussian,
## data = gapminder)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -28.4051 -4.0550 0.2317 4.5073 20.0217
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.185e+02 1.989e+01 -26.062 <2e-16 ***
## gdpPercap 2.985e-04 2.002e-05 14.908 <2e-16 ***
## pop 1.791e-09 1.634e-09 1.096 0.273
## continentAmericas 1.429e+01 4.946e-01 28.898 <2e-16 ***
## continentAsia 9.375e+00 4.719e-01 19.869 <2e-16 ***
## continentEurope 1.936e+01 5.182e-01 37.361 <2e-16 ***
## continentOceania 2.056e+01 1.469e+00 13.995 <2e-16 ***
## year 2.863e-01 1.006e-02 28.469 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 47.37935)
##
## Null deviance: 284148 on 1703 degrees of freedom
## Residual deviance: 80355 on 1696 degrees of freedom
## AIC: 11420
##
## Number of Fisher Scoring iterations: 2
# Store summary method results
sum.reg <- summary(reg)
# View summary method results objects
objects(sum.reg)
## [1] "aic" "aliased" "call" "coefficients"
## [5] "contrasts" "cov.scaled" "cov.unscaled" "deviance"
## [9] "deviance.resid" "df" "df.null" "df.residual"
## [13] "dispersion" "family" "iter" "null.deviance"
## [17] "terms"
# View table of coefficients
sum.reg$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.184555e+02 1.989299e+01 -26.062215 3.248472e-126
## gdpPercap 2.984892e-04 2.002178e-05 14.908225 2.522143e-47
## pop 1.790640e-09 1.634107e-09 1.095791 2.733256e-01
## continentAmericas 1.429204e+01 4.945645e-01 28.898241 1.183161e-149
## continentAsia 9.375486e+00 4.718629e-01 19.869087 3.798275e-79
## continentEurope 1.936120e+01 5.182170e-01 37.361177 2.025551e-223
## continentOceania 2.055921e+01 1.469070e+00 13.994707 3.390781e-42
## year 2.862583e-01 1.005523e-02 28.468586 4.800797e-146
Note that, in our results, R has broken up our variables into their different factor levels (as it will do whenever your regressors have factor levels)
If your data aren't factorized, you can tell glm to factorize a variable (i.e. create dummy variables on the fly) by writing
glm(formula=y~x1+x2+factor(x3), family=family(link="link"), data=)
x1:x2 interacts all terms in x1 with all terms in x2
summary(glm(formula = lifeExp ~ gdpPercap + pop + continent:factor(year),
family=gaussian, data=gapminder))
##
## Call:
## glm(formula = lifeExp ~ gdpPercap + pop + continent:factor(year),
## family = gaussian, data = gapminder)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.5073 -3.4687 0.1739 3.5145 21.1851
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.070e+01 4.745e+00 14.901 < 2e-16
## gdpPercap 3.356e-04 1.997e-05 16.805 < 2e-16
## pop 8.980e-10 1.586e-09 0.566 0.571275
## continentAfrica:factor(year)1952 -3.199e+01 4.831e+00 -6.623 4.77e-11
## continentAmericas:factor(year)1952 -1.881e+01 4.919e+00 -3.823 0.000137
## continentAsia:factor(year)1952 -2.617e+01 4.872e+00 -5.371 8.94e-08
## continentEurope:factor(year)1952 -8.208e+00 4.885e+00 -1.680 0.093146
## continentOceania:factor(year)1952 -4.910e+00 6.668e+00 -0.736 0.461687
## continentAfrica:factor(year)1957 -2.991e+01 4.830e+00 -6.191 7.52e-10
## continentAmericas:factor(year)1957 -1.631e+01 4.918e+00 -3.316 0.000933
## continentAsia:factor(year)1957 -2.337e+01 4.871e+00 -4.797 1.75e-06
## continentEurope:factor(year)1957 -6.351e+00 4.883e+00 -1.301 0.193592
## continentOceania:factor(year)1957 -4.307e+00 6.667e+00 -0.646 0.518393
## continentAfrica:factor(year)1962 -2.793e+01 4.830e+00 -5.782 8.83e-09
## continentAmericas:factor(year)1962 -1.397e+01 4.917e+00 -2.840 0.004564
## continentAsia:factor(year)1962 -2.111e+01 4.871e+00 -4.333 1.56e-05
## continentEurope:factor(year)1962 -4.986e+00 4.880e+00 -1.022 0.307127
## continentOceania:factor(year)1962 -3.886e+00 6.666e+00 -0.583 0.560019
## continentAfrica:factor(year)1967 -2.606e+01 4.829e+00 -5.397 7.75e-08
## continentAmericas:factor(year)1967 -1.221e+01 4.915e+00 -2.484 0.013074
## continentAsia:factor(year)1967 -1.810e+01 4.871e+00 -3.715 0.000210
## continentEurope:factor(year)1967 -4.385e+00 4.877e+00 -0.899 0.368777
## continentOceania:factor(year)1967 -4.265e+00 6.664e+00 -0.640 0.522266
## continentAfrica:factor(year)1972 -2.404e+01 4.828e+00 -4.980 7.03e-07
## continentAmericas:factor(year)1972 -1.051e+01 4.914e+00 -2.138 0.032658
## continentAsia:factor(year)1972 -1.619e+01 4.867e+00 -3.327 0.000899
## continentEurope:factor(year)1972 -4.132e+00 4.874e+00 -0.848 0.396689
## continentOceania:factor(year)1972 -4.311e+00 6.662e+00 -0.647 0.517700
## continentAfrica:factor(year)1977 -2.200e+01 4.828e+00 -4.557 5.58e-06
## continentAmericas:factor(year)1977 -8.800e+00 4.912e+00 -1.791 0.073401
## continentAsia:factor(year)1977 -1.377e+01 4.868e+00 -2.829 0.004722
## continentEurope:factor(year)1977 -3.575e+00 4.871e+00 -0.734 0.463098
## continentOceania:factor(year)1977 -3.657e+00 6.662e+00 -0.549 0.583093
## continentAfrica:factor(year)1982 -1.995e+01 4.828e+00 -4.133 3.77e-05
## continentAmericas:factor(year)1982 -7.017e+00 4.912e+00 -1.428 0.153339
## continentAsia:factor(year)1982 -1.065e+01 4.869e+00 -2.188 0.028825
## continentEurope:factor(year)1982 -3.155e+00 4.870e+00 -0.648 0.517193
## continentOceania:factor(year)1982 -2.649e+00 6.661e+00 -0.398 0.690885
## continentAfrica:factor(year)1987 -1.813e+01 4.828e+00 -3.756 0.000179
## continentAmericas:factor(year)1987 -5.253e+00 4.911e+00 -1.070 0.284987
## continentAsia:factor(year)1987 -8.484e+00 4.869e+00 -1.743 0.081591
## continentEurope:factor(year)1987 -2.855e+00 4.868e+00 -0.587 0.557619
## continentOceania:factor(year)1987 -2.255e+00 6.660e+00 -0.339 0.734934
## continentAfrica:factor(year)1992 -1.785e+01 4.828e+00 -3.697 0.000225
## continentAmericas:factor(year)1992 -3.862e+00 4.911e+00 -0.786 0.431776
## continentAsia:factor(year)1992 -7.151e+00 4.867e+00 -1.469 0.141935
## continentEurope:factor(year)1992 -2.006e+00 4.868e+00 -0.412 0.680292
## continentOceania:factor(year)1992 -7.804e-01 6.659e+00 -0.117 0.906723
## continentAfrica:factor(year)1997 -1.792e+01 4.828e+00 -3.711 0.000213
## continentAmericas:factor(year)1997 -2.565e+00 4.910e+00 -0.522 0.601411
## continentAsia:factor(year)1997 -6.076e+00 4.865e+00 -1.249 0.211930
## continentEurope:factor(year)1997 -1.618e+00 4.866e+00 -0.332 0.739563
## continentOceania:factor(year)1997 -5.865e-01 6.658e+00 -0.088 0.929810
## continentAfrica:factor(year)2002 -1.827e+01 4.827e+00 -3.784 0.000160
## continentAmericas:factor(year)2002 -1.429e+00 4.909e+00 -0.291 0.770984
## continentAsia:factor(year)2002 -4.982e+00 4.865e+00 -1.024 0.305936
## continentEurope:factor(year)2002 -1.307e+00 4.864e+00 -0.269 0.788171
## continentOceania:factor(year)2002 -1.530e-02 6.657e+00 -0.002 0.998167
## continentAfrica:factor(year)2007 -1.695e+01 4.826e+00 -3.512 0.000457
## continentAmericas:factor(year)2007 -8.205e-01 4.906e+00 -0.167 0.867195
## continentAsia:factor(year)2007 -4.265e+00 4.862e+00 -0.877 0.380494
## continentEurope:factor(year)2007 -1.481e+00 4.862e+00 -0.305 0.760680
## continentOceania:factor(year)2007 NA NA NA NA
##
## (Intercept) ***
## gdpPercap ***
## pop
## continentAfrica:factor(year)1952 ***
## continentAmericas:factor(year)1952 ***
## continentAsia:factor(year)1952 ***
## continentEurope:factor(year)1952 .
## continentOceania:factor(year)1952
## continentAfrica:factor(year)1957 ***
## continentAmericas:factor(year)1957 ***
## continentAsia:factor(year)1957 ***
## continentEurope:factor(year)1957
## continentOceania:factor(year)1957
## continentAfrica:factor(year)1962 ***
## continentAmericas:factor(year)1962 **
## continentAsia:factor(year)1962 ***
## continentEurope:factor(year)1962
## continentOceania:factor(year)1962
## continentAfrica:factor(year)1967 ***
## continentAmericas:factor(year)1967 *
## continentAsia:factor(year)1967 ***
## continentEurope:factor(year)1967
## continentOceania:factor(year)1967
## continentAfrica:factor(year)1972 ***
## continentAmericas:factor(year)1972 *
## continentAsia:factor(year)1972 ***
## continentEurope:factor(year)1972
## continentOceania:factor(year)1972
## continentAfrica:factor(year)1977 ***
## continentAmericas:factor(year)1977 .
## continentAsia:factor(year)1977 **
## continentEurope:factor(year)1977
## continentOceania:factor(year)1977
## continentAfrica:factor(year)1982 ***
## continentAmericas:factor(year)1982
## continentAsia:factor(year)1982 *
## continentEurope:factor(year)1982
## continentOceania:factor(year)1982
## continentAfrica:factor(year)1987 ***
## continentAmericas:factor(year)1987
## continentAsia:factor(year)1987 .
## continentEurope:factor(year)1987
## continentOceania:factor(year)1987
## continentAfrica:factor(year)1992 ***
## continentAmericas:factor(year)1992
## continentAsia:factor(year)1992
## continentEurope:factor(year)1992
## continentOceania:factor(year)1992
## continentAfrica:factor(year)1997 ***
## continentAmericas:factor(year)1997
## continentAsia:factor(year)1997
## continentEurope:factor(year)1997
## continentOceania:factor(year)1997
## continentAfrica:factor(year)2002 ***
## continentAmericas:factor(year)2002
## continentAsia:factor(year)2002
## continentEurope:factor(year)2002
## continentOceania:factor(year)2002
## continentAfrica:factor(year)2007 ***
## continentAmericas:factor(year)2007
## continentAsia:factor(year)2007
## continentEurope:factor(year)2007
## continentOceania:factor(year)2007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 44.3151)
##
## Null deviance: 284148 on 1703 degrees of freedom
## Residual deviance: 72765 on 1642 degrees of freedom
## AIC: 11359
##
## Number of Fisher Scoring iterations: 2
x1*x2 produces the cross of x1 and x2, or x1+x2+x1:x2
summary(glm(formula = lifeExp ~ gdpPercap + pop + continent*factor(year),
family=gaussian, data=gapminder))
##
## Call:
## glm(formula = lifeExp ~ gdpPercap + pop + continent * factor(year),
## family = gaussian, data = gapminder)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -29.5073 -3.4687 0.1739 3.5145 21.1851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.871e+01 9.235e-01 41.916 < 2e-16
## gdpPercap 3.356e-04 1.997e-05 16.805 < 2e-16
## pop 8.980e-10 1.586e-09 0.566 0.571275
## continentAmericas 1.319e+01 1.621e+00 8.134 8.09e-16
## continentAsia 5.822e+00 1.485e+00 3.920 9.22e-05
## continentEurope 2.379e+01 1.529e+00 15.557 < 2e-16
## continentOceania 2.708e+01 4.800e+00 5.642 1.98e-08
## factor(year)1957 2.086e+00 1.306e+00 1.598 0.110304
## factor(year)1962 4.067e+00 1.306e+00 3.115 0.001871
## factor(year)1967 5.930e+00 1.306e+00 4.542 5.99e-06
## factor(year)1972 7.948e+00 1.306e+00 6.087 1.43e-09
## factor(year)1977 9.994e+00 1.306e+00 7.653 3.32e-14
## factor(year)1982 1.204e+01 1.306e+00 9.221 < 2e-16
## factor(year)1987 1.386e+01 1.306e+00 10.613 < 2e-16
## factor(year)1992 1.414e+01 1.306e+00 10.830 < 2e-16
## factor(year)1997 1.408e+01 1.306e+00 10.779 < 2e-16
## factor(year)2002 1.373e+01 1.306e+00 10.511 < 2e-16
## factor(year)2007 1.504e+01 1.306e+00 11.515 < 2e-16
## continentAmericas:factor(year)1957 4.129e-01 2.291e+00 0.180 0.857023
## continentAsia:factor(year)1957 7.150e-01 2.095e+00 0.341 0.732979
## continentEurope:factor(year)1957 -2.288e-01 2.159e+00 -0.106 0.915582
## continentOceania:factor(year)1957 -1.483e+00 6.784e+00 -0.219 0.826997
## continentAmericas:factor(year)1962 7.727e-01 2.291e+00 0.337 0.735962
## continentAsia:factor(year)1962 9.945e-01 2.095e+00 0.475 0.635119
## continentEurope:factor(year)1962 -8.452e-01 2.159e+00 -0.391 0.695499
## continentOceania:factor(year)1962 -3.043e+00 6.784e+00 -0.449 0.653798
## continentAmericas:factor(year)1967 6.632e-01 2.291e+00 0.289 0.772261
## continentAsia:factor(year)1967 2.145e+00 2.095e+00 1.024 0.306043
## continentEurope:factor(year)1967 -2.107e+00 2.160e+00 -0.976 0.329424
## continentOceania:factor(year)1967 -5.285e+00 6.784e+00 -0.779 0.436082
## continentAmericas:factor(year)1972 3.507e-01 2.291e+00 0.153 0.878376
## continentAsia:factor(year)1972 2.032e+00 2.096e+00 0.969 0.332439
## continentEurope:factor(year)1972 -3.873e+00 2.161e+00 -1.792 0.073375
## continentOceania:factor(year)1972 -7.349e+00 6.785e+00 -1.083 0.278856
## continentAmericas:factor(year)1977 1.084e-02 2.292e+00 0.005 0.996226
## continentAsia:factor(year)1977 2.404e+00 2.096e+00 1.147 0.251547
## continentEurope:factor(year)1977 -5.362e+00 2.163e+00 -2.478 0.013293
## continentOceania:factor(year)1977 -8.742e+00 6.785e+00 -1.288 0.197779
## continentAmericas:factor(year)1982 -2.520e-01 2.292e+00 -0.110 0.912450
## continentAsia:factor(year)1982 3.479e+00 2.096e+00 1.660 0.097163
## continentEurope:factor(year)1982 -6.988e+00 2.165e+00 -3.227 0.001276
## continentOceania:factor(year)1982 -9.780e+00 6.785e+00 -1.441 0.149674
## continentAmericas:factor(year)1987 -3.056e-01 2.292e+00 -0.133 0.893940
## continentAsia:factor(year)1987 3.829e+00 2.096e+00 1.827 0.067953
## continentEurope:factor(year)1987 -8.505e+00 2.169e+00 -3.922 9.14e-05
## continentOceania:factor(year)1987 -1.120e+01 6.786e+00 -1.651 0.098952
## continentAmericas:factor(year)1992 8.020e-01 2.292e+00 0.350 0.726462
## continentAsia:factor(year)1992 4.878e+00 2.097e+00 2.326 0.020134
## continentEurope:factor(year)1992 -7.940e+00 2.168e+00 -3.662 0.000258
## continentOceania:factor(year)1992 -1.001e+01 6.786e+00 -1.475 0.140318
## continentAmericas:factor(year)1997 2.164e+00 2.292e+00 0.944 0.345341
## continentAsia:factor(year)1997 6.019e+00 2.098e+00 2.869 0.004174
## continentEurope:factor(year)1997 -7.487e+00 2.172e+00 -3.446 0.000582
## continentOceania:factor(year)1997 -9.753e+00 6.788e+00 -1.437 0.150990
## continentAmericas:factor(year)2002 3.649e+00 2.293e+00 1.591 0.111701
## continentAsia:factor(year)2002 7.461e+00 2.099e+00 3.555 0.000388
## continentEurope:factor(year)2002 -6.827e+00 2.178e+00 -3.134 0.001753
## continentOceania:factor(year)2002 -8.833e+00 6.791e+00 -1.301 0.193515
## continentAmericas:factor(year)2007 2.942e+00 2.294e+00 1.283 0.199720
## continentAsia:factor(year)2007 6.864e+00 2.101e+00 3.267 0.001108
## continentEurope:factor(year)2007 -8.316e+00 2.187e+00 -3.803 0.000148
## continentOceania:factor(year)2007 -1.013e+01 6.793e+00 -1.492 0.135983
##
## (Intercept) ***
## gdpPercap ***
## pop
## continentAmericas ***
## continentAsia ***
## continentEurope ***
## continentOceania ***
## factor(year)1957
## factor(year)1962 **
## factor(year)1967 ***
## factor(year)1972 ***
## factor(year)1977 ***
## factor(year)1982 ***
## factor(year)1987 ***
## factor(year)1992 ***
## factor(year)1997 ***
## factor(year)2002 ***
## factor(year)2007 ***
## continentAmericas:factor(year)1957
## continentAsia:factor(year)1957
## continentEurope:factor(year)1957
## continentOceania:factor(year)1957
## continentAmericas:factor(year)1962
## continentAsia:factor(year)1962
## continentEurope:factor(year)1962
## continentOceania:factor(year)1962
## continentAmericas:factor(year)1967
## continentAsia:factor(year)1967
## continentEurope:factor(year)1967
## continentOceania:factor(year)1967
## continentAmericas:factor(year)1972
## continentAsia:factor(year)1972
## continentEurope:factor(year)1972 .
## continentOceania:factor(year)1972
## continentAmericas:factor(year)1977
## continentAsia:factor(year)1977
## continentEurope:factor(year)1977 *
## continentOceania:factor(year)1977
## continentAmericas:factor(year)1982
## continentAsia:factor(year)1982 .
## continentEurope:factor(year)1982 **
## continentOceania:factor(year)1982
## continentAmericas:factor(year)1987
## continentAsia:factor(year)1987 .
## continentEurope:factor(year)1987 ***
## continentOceania:factor(year)1987 .
## continentAmericas:factor(year)1992
## continentAsia:factor(year)1992 *
## continentEurope:factor(year)1992 ***
## continentOceania:factor(year)1992
## continentAmericas:factor(year)1997
## continentAsia:factor(year)1997 **
## continentEurope:factor(year)1997 ***
## continentOceania:factor(year)1997
## continentAmericas:factor(year)2002
## continentAsia:factor(year)2002 ***
## continentEurope:factor(year)2002 **
## continentOceania:factor(year)2002
## continentAmericas:factor(year)2007
## continentAsia:factor(year)2007 **
## continentEurope:factor(year)2007 ***
## continentOceania:factor(year)2007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 44.3151)
##
## Null deviance: 284148 on 1703 degrees of freedom
## Residual deviance: 72765 on 1642 degrees of freedom
## AIC: 11359
##
## Number of Fisher Scoring iterations: 2
The package lmtest has most of what you'll need to run basic regression diagnostics.
Breusch-Pagan Test for Heteroscedasticity
bptest(reg)
##
## studentized Breusch-Pagan test
##
## data: reg
## BP = 174.08, df = 7, p-value < 2.2e-16
bgtest(reg)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: reg
## LM test = 1241.4, df = 1, p-value < 2.2e-16
dwtest(reg)
##
## Durbin-Watson test
##
## data: reg
## DW = 0.29392, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
coeftest(x=reg, vcov.=vcovHC)
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.1846e+02 2.5634e+01 -20.2249 < 2.2e-16 ***
## gdpPercap 2.9849e-04 4.8328e-05 6.1763 6.561e-10 ***
## pop 1.7906e-09 1.3057e-09 1.3714 0.1703
## continentAmericas 1.4292e+01 5.2624e-01 27.1588 < 2.2e-16 ***
## continentAsia 9.3755e+00 5.6555e-01 16.5776 < 2.2e-16 ***
## continentEurope 1.9361e+01 6.9851e-01 27.7177 < 2.2e-16 ***
## continentOceania 2.0559e+01 1.0739e+00 19.1441 < 2.2e-16 ***
## year 2.8626e-01 1.3013e-02 21.9982 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lifeExp for each continentgap_median_lifeExp <- gapminder %>%
group_by(continent) %>%
summarize(med_lifeExp = median(lifeExp))
gapminder_pop_continent <- gapminder %>%
group_by(continent) %>%
mutate(continent_pop = sum(pop))
gdpPercap_diff that contains the difference between the observation's gdpPercap and the mean gdpPercap of the continent in that year. Arrange the dataframe by the column you just created, in descending order (so that the relatively richest country/years are listed first)hint: You might have to ungoup() before you arrange().
gap_rel_gdp <- gapminder %>%
group_by(continent, year) %>%
mutate(gdpPercap_diff = gdpPercap - mean(gdpPercap)) %>%
ungroup() %>%
arrange(desc(gdpPercap_diff))
country, year, and gdpPercap_diff columns. Use tidyr put it in wide format so that countries are rows and years are columns. gap_wide <- gap_rel_gdp %>%
select(country, year, gdpPercap_diff) %>%
spread(year, gdpPercap_diff)
gdpPercap and the explanatory variables are pop, lifeExp, and year. In one model, treat year as a numeric variable. In the other, factorize the year variable. How do you interpret each model?reg1 <- glm(formula = gdpPercap ~ lifeExp + pop + year,
family=gaussian, data=gapminder)
reg2 <- glm(formula = gdpPercap ~ lifeExp + pop + factor(year),
family=gaussian, data=gapminder)
gdpPercap_diff is positive or negative – that is, whether an observation is in the upper half or lower half of the continent's wealth in a given year. The explanatory variables should be country, lifeExp, and pop. gap_rel_gdp$sign <- 0
gap_rel_gdp$sign[gap_rel_gdp$gdpPercap_diff>0] <- 1
reg_logit <- glm(formula = sign ~ lifeExp + pop + country,
family=binomial(link = "logit"), data=gap_rel_gdp)